Skip to content

[None][fix] write per-rank torch profile traces#13536

Open
GavinZhu-GMI wants to merge 1 commit intoNVIDIA:mainfrom
GavinZhu-GMI:feature/per-rank-torch-profile-trace
Open

[None][fix] write per-rank torch profile traces#13536
GavinZhu-GMI wants to merge 1 commit intoNVIDIA:mainfrom
GavinZhu-GMI:feature/per-rank-torch-profile-trace

Conversation

@GavinZhu-GMI
Copy link
Copy Markdown

@GavinZhu-GMI GavinZhu-GMI commented Apr 28, 2026

Summary

PyExecutor reads TLLM_TORCH_PROFILE_TRACE directly and every rank calls torch_profiler.export_chrome_trace() on the same path. At TP/PP/DP > 1 the concurrent writes interleave and the resulting file fails to parse in Chrome tracing / Perfetto (bad control character / unterminated string at the byte where one rank's output overran another's).

Fix: append the rank to the env-provided path before the first use so each rank writes to its own file.

# tensorrt_llm/_torch/pyexecutor/py_executor.py:886
torch_trace_path = os.environ.get(PROFILE_TRACE_ENV_VAR_NAME, None)
if torch_trace_path is not None:
    trace_base, trace_ext = os.path.splitext(torch_trace_path)
    torch_trace_path = f"{trace_base}-rank-{self.global_rank}{trace_ext}"

TLLM_TORCH_PROFILE_TRACE=/tmp/trace.json now produces /tmp/trace-rank-0.json, /tmp/trace-rank-1.json, etc. — same convention SGLang's scheduler_profiler_mixin already uses (user supplies a base, runtime adds the per-rank suffix automatically).

Why this matters

Validation

Reproduced on TRT-LLM 1.3.0rc11 with TP=8 on 8×H200 serving zai-org/GLM-5.1-FP8:

Before patch (single shared path):

$ ls -la /tmp/trtllm-trace.json
-rw-r--r-- 1 dynamo root 26635069 Apr 23 09:22 trtllm-trace.json
$ python3 -c "import json; json.load(open('/tmp/trtllm-trace.json'))"
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 42862 column 6 (char 2452074)

The corrupt byte range contains "name": "void at::native::vectorized_elem truncated mid-string by another rank's { for the next event.

After patch (per-rank paths, same env value, same workload):

$ ls -la /tmp/trtllm-trace-rank-*.json
-rw-r--r-- 1 dynamo root 26635133 Apr 24 02:57 /tmp/trtllm-trace-rank-0.json
-rw-r--r-- 1 dynamo root 26633450 Apr 24 02:57 /tmp/trtllm-trace-rank-1.json
-rw-r--r-- 1 dynamo root 26634910 Apr 24 02:57 /tmp/trtllm-trace-rank-2.json
-rw-r--r-- 1 dynamo root 26634923 Apr 24 02:57 /tmp/trtllm-trace-rank-3.json
-rw-r--r-- 1 dynamo root 26633834 Apr 24 02:57 /tmp/trtllm-trace-rank-4.json
-rw-r--r-- 1 dynamo root 26633499 Apr 24 02:57 /tmp/trtllm-trace-rank-5.json
-rw-r--r-- 1 dynamo root 26634943 Apr 24 02:57 /tmp/trtllm-trace-rank-6.json
-rw-r--r-- 1 dynamo root 26633389 Apr 24 02:57 /tmp/trtllm-trace-rank-7.json

$ python3 -c "import json; print(len(json.load(open('/tmp/trtllm-trace-rank-0.json'))['traceEvents']))"
63079

Distinct sizes confirm no shared clobbering; rank-0 parses with 63,079 events.

Backwards compatibility

  • TLLM_TORCH_PROFILE_TRACE env name unchanged.
  • TP=1 behavior changes: file is now <base>-rank-0<ext> instead of <base>. This is the same compromise SGLang made and is the only sane disambiguation if you ever scale to multi-rank.

Test plan

  • Smoke: TP=8, 500-token decode, 8 distinct files, all parse with json.load
  • No source patch beyond the 8-line block at py_executor.py:886
  • No env-var change required by callers
  • CI

Reviewers

cc @NVIDIA/trt-llm-torch-runtime-devs @byshiue @xxi-nv — re-opening the multi-rank torch-profiler trace fix from #9022 (which went stale without human review) with a smaller diff and concrete reproducer. Would appreciate eyes here so distributed profiling stops silently corrupting traces.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed torch profiler trace export to generate rank-specific filenames in multi-rank environments, preventing file corruption and ensuring Chrome tracing/Perfetto parsing works correctly when tracing is enabled.

@GavinZhu-GMI GavinZhu-GMI requested a review from a team as a code owner April 28, 2026 01:48
@GavinZhu-GMI GavinZhu-GMI requested a review from achartier April 28, 2026 01:48
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f52874de-3a8b-4454-96f2-f97ca30bc0fb

📥 Commits

Reviewing files that changed from the base of the PR and between be1f6f5 and 511c041.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py

📝 Walkthrough

Walkthrough

The change adds rank-specific filename handling to torch profiler trace exports. When tracing is enabled via environment variable, the export filename is rewritten to include the global rank identifier, preventing concurrent writes from multiple ranks to a single file.

Changes

Cohort / File(s) Summary
Torch Profiler Trace Export
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added rank-specific filename rewriting for torch profiler trace exports to prevent concurrent file write conflicts when TLLM_TORCH_PROFILE_TRACE is enabled.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][fix] write per-rank torch profile traces' accurately and concisely summarizes the main change: writing per-rank torch profile trace files instead of a single shared file.
Description check ✅ Passed The description covers the problem, solution, validation, backwards compatibility, and test plan. However, it lacks explicit sections for Test Coverage and PR Checklist completion as specified in the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@achartier
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45826 [ run ] triggered by Bot. Commit: 511c041 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45826 [ run ] completed with state SUCCESS. Commit: 511c041
/LLM/main/L0_MergeRequest_PR pipeline #36010 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 28, 2026
@achartier
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45952 [ run ] triggered by Bot. Commit: 511c041 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45952 [ run ] completed with state FAILURE. Commit: 511c041
/LLM/main/L0_MergeRequest_PR pipeline #36107 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@GavinZhu-GMI
Copy link
Copy Markdown
Author

@tensorrt-cicd Cannot see the exact failure of blossom-ci, can you share the details of the pipeline?

@GavinZhu-GMI GavinZhu-GMI force-pushed the feature/per-rank-torch-profile-trace branch from 511c041 to c9adab8 Compare April 29, 2026 00:56
@achartier
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd Cannot see the exact failure of blossom-ci, can you share the details of the pipeline?

CI flakiness, sorry about that. I'll retry.

@achartier
Copy link
Copy Markdown
Collaborator

/bot run

2 similar comments
@GavinZhu-GMI
Copy link
Copy Markdown
Author

/bot run

@achartier
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46145 [ run ] triggered by Bot. Commit: c9adab8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46145 [ run ] completed with state FAILURE. Commit: c9adab8
/LLM/main/L0_MergeRequest_PR pipeline #36271 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@achartier achartier force-pushed the feature/per-rank-torch-profile-trace branch from c9adab8 to f4529ce Compare April 29, 2026 15:51
@achartier
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46170 [ run ] triggered by Bot. Commit: f4529ce Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46170 [ run ] completed with state SUCCESS. Commit: f4529ce
/LLM/main/L0_MergeRequest_PR pipeline #36291 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@achartier
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46227 [ run ] triggered by Bot. Commit: f4529ce Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46227 [ run ] completed with state SUCCESS. Commit: f4529ce
/LLM/main/L0_MergeRequest_PR pipeline #36339 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

PyExecutor reads TLLM_TORCH_PROFILE_TRACE directly and every rank calls
torch_profiler.export_chrome_trace() on the same path. When TP/PP/DP > 1,
the concurrent writes interleave and the resulting file fails to parse
in Chrome tracing / Perfetto (bad control character / unterminated
string at the byte where one rank's output overran another's).

Append the rank to the env-provided path before the first use so each
rank writes to its own file. Matches SGLang's scheduler_profiler_mixin
filename convention: the user supplies a base path, the runtime adds
the per-rank suffix automatically.

Example: TLLM_TORCH_PROFILE_TRACE=/tmp/trace.json now produces
/tmp/trace-rank-0.json, /tmp/trace-rank-1.json, etc.

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
@achartier achartier force-pushed the feature/per-rank-torch-profile-trace branch from f4529ce to 9c360e0 Compare April 30, 2026 18:43
@achartier
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46427 [ run ] triggered by Bot. Commit: 9c360e0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46427 [ run ] completed with state ABORTED. Commit: 9c360e0

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants